Practical applications of stand-off annotation

نویسندگان

  • Martha Larson
  • Valentin Jijkoun
  • Jobst Löffler
  • Erik Tjong
  • Kim Sang
چکیده

An information system that makes use of stand-off annotation stores metadata separately from the data they describe. System architectures separate metadata from data in order to cope with heterogeneous annotations or with multimedia formats. This paper discusses some of the practical aspects of implementing an information system with a stand-off architecture. Two systems that use stand-off annotations are described. The first is a prototype radio archive that provides users with content-based access to archived radio broadcasts. This system uses stand-off annotation to store structural metadata describing the broadcasts, which is used for interactive presentation, as well as speech recognition transcripts, which are used for search. The second system is a question answering system that searches a large text corpus in order to identify spans of text that provide answers to user questions. This system uses stand-off annotation to store metadata generated by a series of different linguistic analysis tools. The final section of the paper treats practical aspects of implementing a retrieval system for a diachronic language corpus. Similarities and differences with the prototype radio archive and the question answering system are discussed. Ein Informationssystem, das stand-off-Annotation verwendet, speichert Metadaten getrennt von den eigentlichen Daten, die durch die Metadaten beschrieben werden. Systemarchitekturen trennen Metadaten von Daten, um die Handhabung von heterogenen Annotationen oder multimedialen Datenformaten zu ermöglichen. Dieser Beitrag diskutiert einige praktische Aspekte der Implementierung von Informationssystemen mit einer stand-offArchitektur. Zwei Anwendungssysteme, die stand-off-Annotationen einsetzen, werden beschrieben. Das erste ist der Prototype eines Radioarchivs, das dem Benutzer den inhaltsbasierten Zugang zu archivierten Radiosendungen ermöglicht. Das System benutzt stand-off-Annotation einerseits zur Speicherung struktureller Metadaten, die zur interaktiven Darstellung der Radiobeiträge am Benutzerarbeitsplatz eingesetzt werden. Zum anderen wird stand-off-Annotation hier verwendet, um Spracherkennungstransskripte zu verwalten, die vom Benutzer für die inhaltsbasierte Suche im Radioarchiv genutzt werden. Das zweite System ist ein Frage-Antwort-System, das einen großen Textkorpus durchsucht. Das Ziel ist die Identifizierung von Textbereichen, die Antworten auf die vom Benutzer gestellten Fragen geben. Dieses System setzt stand-off-Annotation für die Speicherung von Metadaten ein, die von einer Reihe von verschiedenen linguistischen Analysewerkzeugen erzeugt werden. Der abschließende Abschnitt dieses Beitrags diskutiert praktische Gesichtspunkte der Umsetzung eines Retrievalsystems für einen diachronischen Sprachkorpus. Ähnlichkeiten und Unterschiede der beiden besprochenen Anwendungssysteme, Radioarchiv und FrageAntwort-System,werden erläutert. The architecture of many information systems is designed to support a fundamental division between source data and metadata describing the source data. In the case of multimedia systems, this division is necessary because it is not possible to represent annotation of multimedia content in the same format as multimedia essence. In the case of systems handling character data, the separation of data and data description makes it possible to combine many sources of annotation. The combination of multiple varieties of annotation is known as multidimensional markup. Any annotation that is stored separately from the original content can be referred to as “stand-off annotation.” Although it is not necessarily the case, stand-off annotation is often XML. Even though a separation between data and metadata is convenient and often necessary, systems using stand-off annotation face particular challenges when it comes to providing a mechanism for searching their contents. Solutions that maintain the synchronization between data and data-description and the alignment between various sources and levels of description must be provided. This paper discusses some of the practical aspects of addressing the challenges of implementing an information system based on an architecture that makes use of stand-off annotation. The paper has three parts. In the first two sections, two information systems implemented using stand-off annotations are described, a prototype radio archive and a question answering (QA) system. The systems were designed to provide users with access to two different kinds of information, but both exhibit the principles of architecture and the query mechanisms necessary to put stand-off annotation to practical use. The third section of the paper discusses the issues involved in using stand-off annotation to encode metadata describing a diachronic language corpus. This section focuses, in particular, on some of the practical challenges that can be expected to arise in the design and implementation of an appropriate architecture and the accompanying retrieval mechanism. 1 Prototype radio archive The prototype radio archive was developed by the Fraunhofer Institute for Intelligent Analysis and Information Systems and was commissioned by Deutsche Welle and Westdeutscher Rundfunk. The system provides content-based access to radio broadcasts, which are stored internally in .mp3 format. These radio broadcasts are annotated with both formal metadata and content metadata. The formal metadata comprise items that can be characterized as bibliographic or production data, including archive ID, title and date of broadcast. The content metadata are transcripts of the spoken content of these radio shows, which are generated by automatic audio processing techniques, including automatic speech recognition. Because speech retrieval makes use of speech recognition transcripts there is significant similarity between the architecture of the prototype radio archive and the architecture of a text retrieval system. The system architecture of the prototype radio archive is illustrated in Figure 1. The fundamental separation of the audio essence and the metadata that describes it is clearly evident. At the lower left hand corner the audio database containing the audio essence is pictured. At the right side the metadata is stored in two separate systems, some metadata is stored in an XML database in MPEG-7 format and some is stored in a file system. A fundamental principle behind this architecture is that the audio essence and the metadata are associated by means of time codes that reflect the offset with respect to the beginning of the audio file. It is an interesting characteristic of this system that there are two different types of metadata, MPEG-7 metadata and syllable transcript metadata and that these are encoded in two different formats. Again, these metadata are synchronized using time code offsets from the beginning of the audio file. In the following discussion, each sort of metadata is discussed in turn, and then the process used for querying the system is described. Figure 1: Architecture of the Prototype Radio Archive

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Tool for Feature-Structure Stand-Off-Annotation on Transcriptions of Spoken Discourse

This paper presents an annotation tool and format for the stand-off annotation of transcriptions of spoken discourse like they are produced in a conversion analysis or pragmatic framework. It was developed at the Collaborative Research Centre on Multilingualism in Hamburg, where many suchlike corpora from different research projects exist. It transfers findings from the field of the so-called “...

متن کامل

Stand-off TEI Annotation: the Case of the National Corpus of Polish

We present the annotation architecture of the National Corpus of Polish and discuss problems identified in the TEI stand-off annotation system, which, in its current version, is still very much unfinished and untested, due to both technical reasons (lack of tools implementing the TEIdefined XPointer schemes) and certain problems concerning data representation. We concentrate on two features tha...

متن کامل

Annotation in Architecture: A Systematic Approach toward Mobilization and Development of Theoretical, Research, and Critical Basis in Architecture

Annotations usually refer to marginal notes that explain a difficult or ambiguous subject, provide a general definition or a critical remark for a particular part of a text. Historically, annotating was a well-known tradition in Islamic sciences and was used especially in times when there were less new potentials for generating new knowledge. The main question of this research is, can the tradi...

متن کامل

MAE and MAI: Lightweight Annotation and Adjudication Tools

MAE and MAI are lightweight annotation and adjudication tools for corpus creation. DTDs are used to define the annotation tags and attributes, including extent tags, link tags, and non-consuming tags. Both programs are written in Java and use a stand-alone SQLite database for storage and retrieval of annotation data. Output is in stand-off XML.

متن کامل

Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records

In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007